'
Question 1 \ufeffoptions: Suppose that you observe the following 6 \ufeffepisodes from a Markov Decision process with three states A, \ufeffB, \ufeffand C with a single action for every state. We use no discounting (i.e. \ufeffy=1) \ufeff:A, 0, \ufeffC, 0, \ufeffA, 0, \ufeffC, 1; A, 0, \ufeffC, 1; B, 0, \ufeffC, 0 \ufeff; B, 0, \ufeffC, 0, \ufeffA, 0, \ufeffC, 0; C, 1; C, 1. \ufeffSuppose you run batch updating with first-visit Monte-Carlo learning on those episodes to learn the value function V what result do you get for (write your answer as a fraction with no common divisors for the numerator and denominator): 1) \ufeffV(A) 2) \ufeffV(B) 3) \ufeffV(C) \ufeffSuppose you run batch updating with every-visit Monte-Carlo learning on those episodes to learn the value function V what result will this algorithm converge to for4) \ufeffV(A) 5) \ufeffV(B) 6) \ufeffV(C) \ufeffSuppose you run batch updating with the Temporal-Difference learning on those episodes to learn the value function V what result will this algorithm converge to for7) \ufeffV(A) 8) \ufeffV(B) 9) \ufeffV(C) \ufeffKnowing that we are dealing with an MDP and now that you have estimated the state Value function in three difference ways what do you think is the best estimate for V?10) \ufeffV(A) 11) \ufeffV(B) 12) \ufeffV(C)